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Abstract 

An importance weight quantifies the rela- 
tive importance of one example over another, 
coming up in applications of boosting, asym- 
metric classification costs, reductions, and 
active learning. The standard approach for 
dealing with importance weights in gradient 
descent is via multiplication of the gradient. 
We first demonstrate the problems of this ap- 
proach when importance weights are large, 
and argue in favor of more sophisticated ways 
for dealing with them. We then develop an 
approach which enjoys an invariance prop- 
erty: that updating twice with importance 
weight h is equivalent to updating once with 
importance weight 2h. For many important 
losses this has a closed form update which 
satisfies standard regret guarantees when all 
examples have h = 1. We also briefly dis- 
cuss two other reasonable approaches for han- 
dling large importance weights. Empirically, 
these approaches yield substantially superior 
prediction with similar computational perfor- 
mance while reducing the sensitivity of the 
algorithm to the exact setting of the learning 
rate. We apply these to online active learning 
yielding an extraordinarily fast active learn- 
ing algorithm that works even in the presence 
of adversarial noise. 



1 INTRODUCTION 

Importance weights appear in boosting algorithms [S] 
which assign a weight to each example depending on 
how well this point has been classified in previous it- 
erations, covariate shift algorithms which assign a 
weight to a training example according to how close 
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to the test distribution the example is, and active 
learning algorithms [U [5] where an adaptive rejec- 
tion sampling scheme is applied to each example and 
each retained example gets an importance equal to 
the inverse probability of being retained. Importance 
weights have become a de-facto language for specifying 
the relative importance of prediction amongst exam- 
ples. 

When not concerned by computational constraints, 
importance weights can be dealt with using either 
black box techniques [T8j |20] or direct modification of 
existing algorithms such that an existing example with 
importance weight h is treated as h examples. How- 
ever, when computational constraints are significant 
online gradient descent based algorithms are preferred. 
Here the standard approach of treating an example 
with importance weight h as h examples is typically 
translated into practice via multiplying the gradient 
by h. This is undesirable for large h because such an 
example can cause an update that's far beyond what's 
necessary to attain a small loss on it. 

An important observation is that multiplying the gra- 
dient by h is typically not equivalent to doing h up- 
dates via gradient descent because all loss functions 
of interest are nonlinear. The goal of this paper is 
resolving this translation failure by investigating al- 
ternate updates that gracefully deal with importance 
weights, by taking into account the curvature of the 
loss. Among these updates we mainly focus on a novel 
set of updates that satisfies an additional invariance 
property: for all importance weights of h, the update 
is equivalent to two updates with importance weight 
h/2. We call these updates importance invariant. 

Even though the importance invariant updates will be 
defined via an ordinary differential equation (ODE), 
we were surprised to find that they are closed-form 
for all common loss functions. We were also surprised 
to discover that the importance weight invariant up- 
date substantially improves the learned predictor even 
when h = 1, both in terms of the quality of best pre- 



dictor after a parameter search and in terms of the ro- 
bustness to parameter search, effectively reducing the 
desirability of searching over some schedule of learn- 
ing rates. Upon inspection, the reason for this is that 
an importance weight invariant update smoothly in- 
terpolates between a very aggressive projection pi^llS] 
algorithm and a less aggressive gradient multiplier de- 
cay algorithm. All of these benefits come at near-zero 
computational cost. 

Among the other algorithms we consider, implicit up- 
dates [131 E] turn out to coincide with importance 
invariant ones for piecewise linear losses and provide 
qualitatively similar updates for other losses, implying 
that our derivation is an alternative way to motivate 
this style of algorithm. For most other loss functions 
implicit updates require a root-finding algorithm. 

Finally, another reasonable way to handle importance 
is related to [6], who analyze it for the logistic and 
exponential losses. Here, a second order Taylor ex- 
pansion at the current prediction and in the direction 
of the update is used to approximate the loss. These 
updates coincide with implicit updates for squared loss 
(since the quadratic approximation is exact) and are 
not applicable to piecewise linear losses. We won't 
discuss these updates any further since they have only 
been analyzed for very specific loss functions. 

In section [2] we define the problem and describe some 
obvious but unsatisfactory approaches. Next, we pro- 
pose the importance invariant solution and present a 
general framework for deriving importance invariant 
updates for many loss functions, in section |3] We 
subsequently discuss some important properties of the 
proposed updates, such as safety, in section |4] Sec- 
tion [5] briefly covers how implicit updates can han- 
dle importance weights. In section [6] we empirically 
demonstrate the merits of not linearizing the loss on 
problems with and without importance weights. Sec- 
tion [7] states our conclusions. 

2 PROBLEM SETTING 

We assume access to a training set of triplets 
{xt,yt,ht), t l,...,r where xt G M'* is a vector 
of d features, ht G M+ is an importance weight, and 
2/t e M is a label. We are also given a loss function 
^(p, y) where p is the prediction of our model and y 
is the actual label. Depending on the loss function, y 
may take values in a restricted set, such as {—1,-1-1} 
or {0, 1}. In this paper we focus on linear models i.e. 
p = X where w G is a vector of weights. Our 
goal is to find 

T 

w = argmin^/if^(w^a;t,yf), (1) 



Algorithm 1 Online Gradient Descent 
wi 

for t = 1 to T do 
done 



using online gradient descent. When examples do not 
have importance weights the online gradient descent 
algorithm is shown in Algorithm [T] The notation 
w^iwj Xt, yt) means the gradient of the loss with re- 
spect to w evaluated at the i-th prediction and label. 

When examples have importance weights, we would 
like to adhere to the following principle: An example 
with importance weight h should he treated as if it is 
a regular example that appears h times in the dataset. 
This is a statement of both mathematical and seman- 
tic correctness. Mathematically, ([I]) states exactly the 
same thing. Semantically an example of importance 
h is just a convenient encoding of h identical exam- 
ples. For now we assume importance weights are inte- 
gers and the learning rate sequence is constant rjt = rj. 
These assumptions are only for ease of exposition and 
are lifted in section 3. 

2.1 Some Unsatisfactory Approaches 

A first approach would be to loop through the data 
enough epochs, with epoch i using only those exam- 
ples whose importance is greater than i. While this 
is a valid approach, it is very inefficient. Ideally each 
example should be presented once to the learner. 

Another tempting approach is multiplying the update 
by the importance weight; 

wt+i ^wt- htTjV^iiwJxt^yt)- 

However, this update rule does not respect the princi- 
ple of the previous section. To see this, consider the 
case ht = 2. The above rule should be equivalent to 

V = wt ~ r]\'w£{wjxt,yt) 
wt+i = V - viVwt{v^ xt,yt) 

which is not true in general. Furthermore, the quality 
of this update gets worse as the importance weight gets 
larger since the first order approximation of the loss is 
invalid far away from its expansion point. 

Another approach with good computational character- 
istics is rejection sampling accoding to /i/ft-max- How- 
ever, this approach generally decreases performance 
due to throwing out samples. Rejection sampling can 
be repaired by learning multiple predictors based upon 
different rejection sampled datasets [20], but this of 
course increases computation substantially. 



2.2 An Efficient Invariant Approach 

To achieve invariance and efSciency we will focus on 
the cumulative effect of presenting an example h times 
in a row. This scheme respects our correctness prin- 
ciple and considers each example once. The only re- 
maining question is whether the cumulative effect of 
h presentations in a row can be computed faster than 
explicitly doing so. In the next section we explain why 
this is possible and how to compute it. 

3 A FRAMEWORK FOR 
DERIVING STEP SIZES 

Given a loss of the form y) where p is the predic- 
tion and assuming a linear model p — w^x we have 
that Vu,£ = l^x. Therefore all gradients of a given 
example point to the same direction and only differ 
in magnitude. Hence computing the cumulative effect 
of presenting the example h times in a row amounts 
to computing a global scaling for x that aggregates 
the effects of all the gradients. We begin with a sim- 
ple lemma that formalizes this in the case of integer 
importance weights. 

Lemma 1. Let h G N. Presenting example {x,y) h 
times in a row is equivalent to the update 

Wt+i ^ Wt ~ s{h)x (2) 

where the scaling factor s{h) has this recursive form: 

dl 



s{h + 1) = s{h) + T] 
s(0) = 



dp 



p—{wt — s{h)x)^x 



(3) 
(4) 



Proof. By induction on h. The base case is obvious. 
Now the effect of presenting the example (x, y) h + 
1 times can be computed by performing a gradient 
update on the vector v that results from presenting 
the example h times. By the induction hypothesis this 
intermediate vector is 

V — Wt — s{h)x 

and the gradient descent step is 

Wt+i = v- 'qW^i{w'^x,y)\^=y. 

Expanding this using the induction hypothesis we get 

'^P p=v^x 

di \ 

I T 

dp 



Wt+l 




p—('Wt — s(h)x)^x I 



Given a loss function, one could try to find a closed 
form solution to the recurrence defined by |3|,(|4]). For 



example for squared loss y) 
rence is 



i{p — y) the recur- 



s{h + 1) = s(h) + T]{{wt - s{h)x)^x - y). 
A simple inductive argument can then verify that 



s{h) 



wj X - 



(5) 



Note that when r]x^ x < 1, s{h) asymptotes to the 
quantity that would make 'wJ_^_lX = y. This behavior 
is more desirable than that of multiplying the gradient 
with the importance weight. 

Notwithstanding the significance of such an update, it 
is restricted to integer importance weights. Moreover 
other loss functions do not yield a recurrence with a 
closed form solution. To overcome these problems, we 
use ([5]) as a starting point to think about the con- 
sequences of presenting an example many times. To 
compensate, we will also need to adjust the learning 
rate we use every time we present an example. Suppose 
that we present an example a factor of n times more 
using a learning rate that is smaller by a factor of n. 
This can be simulated in constant time using ([s]) with 
hn and ^ in place of h and rj respectively. Letting n 
grow large, we are interested in 



lim 



1-1- 



rjx^ X 



Using that hm„_j.oo(l + z/nY^ = we have that 

s{h) = "^ll^LZl (1 _ eM-hvx'^x)) ■ (6) 



The key in the above derivation is using the limit of the 
gradient descent process as the learning rate becomes 
infinitesimal. We now generalize this idea to derive 
updates for other loss functions. 

Theorem 1. The limit of the gradient descent pro- 
cess as the learning rate becomes infinitesimal for an 
example with importance weight h G is equal to the 
update 

Wt+i =wt- s{h)x 

where the scaling factor s{h) satisfies the differential 
equation: 



s\h) - 77 



di 
dp 



s{0) - (7) 



p—(wt — s(h)x)~^x 



□ 



The proof is in the appendix. This theorem is our 
framework for deriving updates for many loss func- 
tions. Plugging a loss function in ([7| gives an ODE 



whose solution is the result of a continuous gradient 
descent process. The ODE can be easily solved by 
separation of variables. 

As a sanity check for squared loss, ^ 



p-y, ([7j) gives 

s\h)^f]{{wt-s{h)x)'^x-y), s(0) = 
a linear ODE, whose solution exactly rederives 

3.1 Other Loss Functions 

Using ([7| as our framework, we can derive step sizes for 
many popular loss function as summarized in table [ij 

For the logistic loss, the solution involves the Lambert 
W function: W^(z)e^^^^ = z, and the solution can be 
verified using W'{z) — jjj^^^^- The exponential 
loss also fits nicely into our framework. 

For the logarithmic loss the ODE has no explicit form 
for all y S [0, 1]. The table presents the common case 
y e {0, 1}. In this case each value of y gives rise to 
an ODE whose solution has an explicit form. Note 
that here the ODE solutions satisfy a second degree 
equation and hence each branch has two solutions. We 
have selected the one satisfying s'(0) =7]^. To avoid 
an infinite loss, we can clip the predictions away from 
and 1 (and update using min(/i, h') where h' is the 
importance weight that hits the clipping point) or use 
a link function such as tanh. In the appendix we show 
updates for this latter case. 

A similar situation arises for the Hellinger loss. The 
solution to ([7| has no simple form for all y e [0, 1] but 
for y € {0, 1} we get the expressions in table [T] 

3.1.1 Hinge Loss and Quantile Loss 

Two other commonly used loss function are the hinge 
loss and the r-quantile loss where r € [0, 1] is a pa- 
rameter. These are differentiable everywhere except 
at one point where the subdifferential contains zero. 

Hence, for the hinge loss, a valid expression for ([t]) is 

'7]y y{w — s{h)x)'^ X < 1 



s'(h) 



y{w — s{h)x)'^ X > 1 



The first branch (together with s(0) — 0) gives s{h) = 
—yhr] for y{w + yhrjx)^ x < 1. Otherwise, i.e. when 

h > hunge = ^^It^^ , s{h) is a constant. Here /ihingc is 
the importance weight that would make the updated 
prediction lie at the hinge. To maintain continuity at 
/ihingc we set s{h) = —yhhingcV- In conclusion 

s{h) = -?/min(/i, ft-hingo)?? 

This matches the intuition when one thinks about the 
limit of infinitely many infinitely small updates: For 



large importance weights, the process will bring the 
prediction up to y and make no further progress. 

The quantile loss is similar and the update rule first 
computes the importance weight ft.' that would take 
the updated prediction at the point of nondifferentia- 
bility and then multiplies the gradient by min(/i, h'). 

3.2 Variable Learning Rate 

To handle a decaying learning rate ryt, we just need to 
slightly modify ([7| . Let r]t {u) be the value of the learn- 
ing rate u timesteps after time t. Then ([t]) becomes 



s'{h)^rjt{h) 



dl 
dp 



s(0) = 



The solutions in this case are not qualitatively any 
different from the solutions of ([t]). We just need to 
replace the occurrences of hrj with rjt{u)du. Again 
for popular choices of learning rate such as ?7t(u) — 
(^^^^■^p with p = i or p = 1, this has a closed form. 

3.3 Regularization 

Theorem[T]can be modified to handle losses of the form 
£{w^x,y) + f where is the Euclidean norm 
of w. However, the resulting differential equation is 
considerably harder and we have only been able to ob- 
tain a solution for the case of squared loss. Therefore 
we describe an alternative way of incorporating reg- 
ularization based on a splitting approach [8j. First 
perform h unconstrained steps using the closed form 
solution, then compute the effect of regularization: 



wt+i = argmin l-\\w - {wt - s{h)x)\\'^ + 



hrjX 



Note that we apply all h regularizers at once. The 
solution to the above optimization problem is 



Wt+l 



wt - s{h)xt 
1 + h'qX 



This approach can also handle other regularizers such 
as A||w||i leading to a truncated gradient update '16'. 

4 PROPERTIES OF THE UPDATES 
4.1 Invariance 

First we show that the invariance property we men- 
tioned in the introduction holds. It is convenient to 
explicitly state the dependence on the prediction p, by 
writing s(p, h) instead of s{h). The following theorem 
states that an update with an importance weight a + b 
is equivalent to an update with importance a immedi- 
ately followed by an update with importance b. 



Tabic 1: Importance Invariant and Imp^ (cf. sectional) Updates for Various Loss Functions 



Loss 



Invariant Update s{h) 



Imp^ Update 



Squared 



\{y-pf 



1 - e 



-hrjx X 



hri(p-y) 



Logistic 



log(l + e-2'P) 



yx 'a; y ^ L ' J 



Not Closed 



Exponential 
Logarithmic 



-VP 



y\ogl + {l-v) \og\^ 



py — log(/i77a; x-\-e^^) 

x^ xy 



for y e {-1, 1} 



if y = 
if y = 1 



p—-\/ p'^ -\-2hr]x'^ X 



Not Closed 

Not Closed 



Hellinger 



2(l-v^- v/(l-p)(l-2/)) 



p-l+|(12/ir/a; ' a;+8(l-p)^^'')''''^ 



if y = 1 



-yuAn [hi], for y g {-1, 1} 



Not Closed 



Hinge 



max(0, 1 — yp) 



Same 



T-Quantile 



iiy>p T{y-p) 
iiy<p {l-T){p-y) 



\iy>p 



Same 



Theorem 2. Let s{p, h) be the solution of 



ds 



di 



V 



dh ' dp 



, s(p,0) = 



p—w'^x — s{p,h)x'^x 

where £ is a continuously differentiable loss. Then 
s{p, a + b) = s{p, a) + s{p — s{p, a)x^ x, b) 

The proof is in the appendix and uses the Existence 
and Uniqueness Theorem for ODEs. 

4.2 Safety 

For some loss functions such as squared loss, hinge loss 
and quantile loss the residual wjxt—yt tells us whether 
the learner is overestimating or underestimating the 
target. We call an update safe if 



wj+ixt - yt 
wjxt - yt 



> 



whenever {wj Xt — yt) 7^ 0. Since the residual does not 
change sign after a safe update, it leads to sane results 
even when the learning rate is very aggressive. 

Standard gradient descent is not safe, and importance 
invariant step sizes always are. This should be obvious 
for the hinge loss and the quantile loss because they 
use the minimum necessary step. For squared loss: 



wjx+y-^{i 



e ^ I x'^ X — y 



wJ X — y 



-hrjX X 



> 



hence the update is safe. 



4.3 Fallback Regret Analysis 

Here we provide a fallback analysis for the case ht — 1. 
For simplicity we only show the results for squared loss 
and \ \xt\ \ — 1 for all t. However, this can be extended 
to other losses as a Taylor expansion of each update 
around 77 = shows that to first order, it is equivalent 
to online gradient descent. Hence we expect a regret 
analysis similar to the one achieved by the underly- 
ing learning rate schedule. A proof of the following 
theorem is in the appendix. 

Theorem 3. If£{p,y) ~ {p — yY o,''^d \\xt\\ = 1 for all 
t then the importance invariant update attains a regret 
of 0{VT) when rjt — and a regret of 0{\og{T)) 

when r/t — J ■ 

5 IMPLICIT IMPORTANCE 
WEIGHT UPDATES 

Implicit updates, first proposed in |13) and recently an- 
alyzed in !143 provide an alternative way for handling 
importance weights. An implicit update sets: 

wt+i = argmin }:\\w - wt\\'^ + M{w^ x, y) 



where A is a free parameter similar to the learning rate 
in interpretation. Finding the minimizing w generally 
requires an iterative root-finding algorithm which is 
perhaps an order of magnitude more expensive than 
the closed form updates we derive above, although of- 
ten easily amortized by the update itself. For squared 
loss and hinge loss closed-form solutions have been 
known. In the appendix we show how to derive these 
as well as a closed form implicit update for quantile 
loss which, to the best of our knowledge, is new. To 
adapt implicit updates for importance weights we sim- 
ply use At = ry/it (or \t — /q ' rjt{u)du as in section 3.2 ) 
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Figure 1: Test error vs. fraction of queried labels for each dataset 



yielding an algorithm we call Imp^. Imp^ has qualita- 
tively similar properties, satisfying Safety and Regret, 
but not Invariance or a Closed form update. In fact, 
Imp^ for hinge and quantile is precisely equivalent to 
the importance invariant update. 

6 EXPERIMENTS 

We present empirical results on four text classification 
datasets: 'rcvl' is a modified version 15 of RCVl 
|17j . 'astro' is from [H], 'spam' was created from the 
TREC 2005 spam public corpora, and 'webspam' is 
from the PASCAL large scale learning challenge. In all 
experiments we did a single pass through the training 
set and we report the error on the test set. We try all 
learning rate schedules of the form rjt = 
with {fi,T,p) G {2»}i£o X {lOnto X {0-5, iV 

In the first set of experiments, large importance 
weights will inevitably appear. In particular, we treat 
all four datasets as active learning tasks and apply 
the algorithm in ^ . At each timestep t this algorithm 
computes a probability of querying the label of a point 
based on a quantity Gt that measures the difference in 
error rates between two hypotheses. The empirical risk 



minimizing (ERM) hypothesis and an alternative hy- 
pothesis that minimizes the empirical risk subject to 
predicting a different label than the ERM hypothesis. 
In our case the hypothesis we are learning with online 
gradient descent acts as the ERM hypothesis. To get 
a handle on Gt we estimate i • Gt by the importance 
weight that the example would need to have in order 
for an update with the alternative hypothesis's pref- 
ered label to cause the classification of the example to 
become the alternative label. In the appendix we de- 
rive the relevant importance weights for invariant and 
implicit updates. Once we have an estimate for Gt 
we can compute the query probability Pt and, if we 
query the label, we add the example to our dataset 
with importance weight 1/Pt to keep things unbiased. 
The query probabilities turn out to be proportional 
to the learning rate rjt and hence the algorithm will 
generate importance weights of order 0(ri^^) grow- 
ing at least as fast as ri(v^). Notice that importance 
weights are used for two different tasks: estimating Gt 
and preserving unbiasedness. Our linear model (with 
a link function (t{p) — max(0, min(l,p))) is optimizing 
squared loss. Details are in the appendix. 
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Figure 2: Scatter plot showing test accuracy with two 
different updates for various datasets and losses 



In Figure [T] we summarize the results of the active 
learning experiments. Each combination of learning 
rate schedule and setting of the parameter Cq in the 
active learning algorithm (Cq € {10~*, 10"'', . . . , 10^}) 
is an experiment that can be represented in the graph 
by a point whose x-coordinate is the fraction of labels 
queried by the active learning algorithm and whose 
y-coordinate is the test error of the learned hypothe- 
sis. To summarize this set of points, the figures plot 
part of its convex hull. The points on the convex 
hull (sometimes called a Pareto frontier) are experi- 
ments which represent optimal tradeoffs between gen- 
eralization and label complexity, for some setting of 
this tradeoff. When a curve stops sooner than the size 
of the dataset it means that there were no experiments 
in which using more queries gave better generalization. 
We have also included the results from a typical good 
run of a passive learner. The graphs show very con- 
vincingly the value of having an update that handles 
importance weights correctly. Doing so yields better 
generalization and lower label complexity, than those 
attainable by multiplying the gradient with the impor- 
tance weight. In fact, table [2] which lists the ratio of 
labels between passive and active learning to achieve 
a given accuracy, shows that linearization can make 
active learning need more labels than passive learning. 

In the second set of experiments we used the same 
datasets but treated all examples as having an impor- 
tance weight of one. We compare standard online gra- 
dient, vs. invariant and implicit updates on four loss 
functions: squared, logistic, hinge and quantile(r = 
0.5) loss. The purpose of these experiments is to high- 
light the robustness of the invariant updates: they 
yield good generalization with little search for a good 
learning rate schedule (also noted in [T3| for implicit 
updates). However, we begin with tablejS which shows 
the test accuracy of the hypotheses learned by each up- 



Table 2: Reduction in label complexity 



Dataset 


astro 


rcvl 


spam 


web 


Desired Accuracy 


0.963 


0.943 


0.967 


0.986 


Multiplication 


0.45 


1.59 


1.28 


0.54 


Implicit 


5.12 


6.55 


1.88 


4.33 


Invariant 


7.56 


6.55 


2.13 


1.82 



Table 3: Test accuracies (grid search over schedules) 



Dataset 


Loss 


Invariant 




Standard 


astro 


hinge 


u.yoozu 


SclII16 






logistic 


0.96494 


0.96485 


0.96432 




quantile 


0.96629 


Same 


0.96703 




squared 


0.96463 


0.96429 


0.96469 


rcvl 


hinge 


0.94872 


Same 


0.94838 




logistic 


0.94704 


0.94682 


0.94743 




quantile 


0.94846 


Same 


0.94859 




squared 


0.94769 


0.94799 


0.94790 


spam 


hinge 


0.97626 


Same 


0.97411 




logistic 


0.96676 


0.97982 


0.97603 




quantile 


0.97524 


Same 


0.97484 




squared 


0.97609 


0.97614 


0.97563 


web 


hinge 


0.98936 


Same 


0.99142 




logistic 


0.99094 


0.99038 


0.9923 




quantile 


0.98908 


Same 


0.99088 




squared 


0.98960 


0.98966 


0.99218 



date after exhaustively searching over the learning rate 
schedule. For astro and rcvl the differences are very 
small. The spam dataset is not TF-IDF processed, 
and there we see a substantial improvement with ei- 
ther new update. The results on the webspam dataset 
were initially puzzling, but we have verified that this 
is not a failure to optimize well; on the contrary the 
proposed updates attain smaller progressive validation 
loss ^ than standard online gradient descent on the 
training data. Since progessive validation loss deviates 
like a test set, this is evidence that the webspam test 
set has a different distribution from the training setj^ 

To illustrate robustness we present the results in two 
ways. First, in Figure [2] we show a scatterplot where 
each point is a learning rate schedule and its coordi- 
nates are the accuracy of the learned hypothesis with 
and without the proposed updates. Scatter plots for 
other loss functions and datasets look very similar and 
are included in the appendix. The plot only shows the 
cases when both learning rates achieve accuracy above 
0.9 and there are virtually no schedules for which one 
update is superior by more than 0.1. Among these 
cases the vast majority of experiments are clustered 
under the y = x line and towards the extreme values 



^The test set consists of the last 50000 examples of the 
original training set. The real test labels are not public. 



Table 4: Fraction of schedules with near optimal error 



Loss 


Invariant 


Standard 


hinge 


0.337 


0.039 


logistic 


0.109 


0.050 


quantile 


0.361 


0.053 


squared 


0.306 


0.031 



of the a;-axis. Consequently, when using the impor- 
tance invariant update many schedules provide excel- 
lent performance. 

To make this more clear we present a second way of 
viewing this result. In table |4] we report the fraction 
of learning rate schedules that achieve generalization 
accuracy within 0.001 of the best learning rate sched- 
ule, on average across all four datasets. For loss func- 
tions for which there is a notion of overshooting and 
can benefit from a safe update we observe an order 
of magnitude improvement in the number of schedules 
that converge to near optimal performance. 

7 CONCLUSIONS 

We tuned online gradient descent learning algorithms 
for various losses so they efficiently incorporate im- 
portance weight information, as is needed for appli- 
cations in boosting, active learning, transfer learning, 
and learning reductions. The essential lesson here is 
that taking into account the curvature of the loss func- 
tion can be done cheaply and provides great benefits 
in dealing with importance weights. 

Motivated by an invariance property we proposed new 
updates that improve the standard update rule even 
for the baseline importance weight 1 case, yielding bet- 
ter prediction performance while simultaneously re- 
ducing the value of learning rate parameter search. 
Experiments not reported here show that it even im- 
proves the performance of adaptive gradient descent 
methods such as [H [7] • Since this tuned update rule is 
computationally "free" we expect wide use. 
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APPENDIX 

A PROOF OF MAIN THEOREM 

In this section we prove Theorem [T] 

Theorem. The limit of the gradient descent process as 
the learning rate becomes infinitesimal for an example 
with importance weight h G is equal to the update 

Wt+i =Wt- s{h)x 

where the scaling factor s{h) satisfies the differential 
equation: 



s'{h) = r; 



dp 



, s(0) = (8) 



p—{wt — s{h)x)^x 



Proof. In accordance to the proof of lemma [Tj we can 
compute the effect of an importance weight of h + e, 
assuming we know the effect of an importance weight 
of h, by performing an additional gradient step with 
the learning rate appropriately scaled by e: 



s{h + e) = s{h) + er] 



de 

dp 



p—['Wt~s{h)x)'^x 



Rearranging we have 

S{h + €)- S{h) 



dp 



p—{wt~ s[h)x)~^ X 



Taking the limit as e — ?> gives the result. The initial 
condition makes sure that a zero importance weight 
has no effect in the update. □ 



B PROOF OF INVARIANCE 



In this section we prove Theorem [2] For convenience 



let f{z) = r, 



dp 



and n — x^x. The theorem says 



that if / is continuous and s satisfies 
f{p-ns{p, h)) 



ds 
dh 

s(p,0) =0 



then 



s{p, a + b) = s{p, a) + s{p — ns{p, a),b). 



(9) 



Proof. Fix an arbitrary value of a, and consider the 
pair of single-variable functions 

u{b) = s{p, a + b) 

v{b) = s{p, a) + s{p — ns{p, a),b). 



These satisfy 



m(0) = i;(0) s{p,a) 



and they satisfy 



du 
db 

dv 
db 



dh) 



f{p - ns{p, a + b)) = f{p - nu{b)) 



(p,a+b) 



ds_\ 

^^J (p-ns(p,a),h) 

f{p - ns{p, a) - ns{p - ns{p, a), b)) = f{p ~ nv{b)). 



In other words, both u and v are solutions of the or- 
dinary differential equation 



dw 
lib 



f{p-nw{b)) 



with initial condition w{0) = s{p,a). Since / satisfies 
the hypotheses of the existence and uniqueness theo- 
rem for ordinary differential equations, it is valid to 
conclude that u{b) — v{b), which verifies ([9|. □ 

C FALLBACK REGRET PROOF 

In this section wc prove Theorem [3] 

Theorem. If£{p,y) = {p — y)^ and \ \xt\\ = 1 for all t 
then the importance invariant update attains a regret 
of 0{VT) when rjt — and a regret of 0{\og{T)) 

when rjt — J. 

Proof. (Sketch) First, consider the case when rjt — 
and therefore the step sizes are 77^ = 1 — exp(— ^). In 



general the regret of online gradient descent depends 
on the step sizes via [5T] 



1 



T 

E 



1 



Vt 



1 — exp(- 



exp(-^)) 



Ed 

t=l 

The Taylor exansion of -, — ^ around zero sue- 

^ 1— cxp(— ^ 

gests that the first term grows asymptotically as VT + 



The second term can be bounded by 



f + (1 — exp(— which evaluates to 



rexp(- 



)+T(f-exp(- — ))-r(o, ^)+r(o, 1) 



where r(a,x) — u"' ^du is the incomplete 
gamma function, with the property r(0, -) = 
0(log(u)) for large u. The first two terms are both 
0{VT), the third term is 0(log(VT)) and the last 
term is constant. Hence the regret is 0{\/T), as is the 
regret we would attain with ry^ = ^ . 

Now, we consider the case rjt = 0{j). This type 
of schedule is known to give logarithmic regret for 
strongly convex functions such as squared loss. The 
same holds true for our update rule. In general, for 
a 1-strongly convex loss such as squared loss, regret 
depends on [12] 



E 



- 1 



V't-i 



E 

t=i 



1 - e- 



1 



1 



T 

E 



1 



By computing the Taylor series of the first term, we see 
that each individual summand is O(p-) and therefore 
the first sum is bounded by a constant. As before, the 
second sum can be bounded by 



1+T(1- 



r(o,^). 



The second term is less than 1 and the third term is 
0(log(T)), which is the regret attained by the original 
r]t = \ update. □ 

D EXPLICIT IMPLICIT UPDATES 

Here we derive closed form updates for Imp^ for some 
losses. The results on squared loss and hinge loss are 
known, but the result quantile loss is new. 

D.l Squared Loss 

For the squared loss we have 

wt+i = arg min ^ 1 1 w - | P + ^ {w^ x - y)'^ 



Taking the derivative of the objective with respect to 
w and setting to zero we get 



w 



Wt — \{vj^ X — y)x 



(10) 



We can solve for ui^ x by taking the inner product of 
both sides with x: 

X = wj X — X{w^ X — y)x^ X 



X 



wJ X + Xyx^x 
1 + Xx'^x 



Now we substitute back in ( 10 ) and we get 



X{wjx - y) 

W = Wt ; — =F X 

1 + Xx^x 

D.2 Hinge Loss 

The implicit update for the hinge loss comes from the 
solution of 

wt+i = argmin ^llw - wtlp + 

s.t. C > 

^ > 1 — yw^ X 

This update has been considered by [5] under the name 
PA-I. They show that the solution to the above opti- 
mization problem is 

. l-ywtx 

Wt+i = Wt + y mm(A, ^ )x 

x ' X 

which is the same as our importance invariant update. 
However, they treat A as a fixed hyperparameter, while 
here X = htrit is varying as t varies. 

D.3 Quantile Loss 

For the quantile loss, we can write the update as: 

wt+i = argmin - Wflp + A^ 

s.t. ^ > T{y — w^ x) 

i > {1 - t){w'^ X - y) 

To find Wt+i we introduce the Lagrangian: 

L{w,tfi) = ^\\w~Wt\\^ + X^ + 

+tJ-i{T{y - w'^x) -£,) + 

+/i2((l-T)(ii;^x-y) -e) 

with Lagrange multipliers /ii > 0, /i2 > 0. Setting the 
derivative of L to zero w.r.t. w and ^ we get 

W = Wt + ^iTX — /i2(l ~ ''')x (11) 

Aii+M2 = A (12) 
We now distinguish the following cases: 



X > y. This means that (1 — t){w^ x — y) > 
and therefore the second constraint suggests ^ > 
0. This implies that t(?/ — w^x) — ^ < 0, in other 
words the first constraint is not tight. Using the 
KKT complementary slackness condition fii{T{y— 
w^x)—^) = we conclude that in this case = 
and hence ^2 = A. Hence ( 11 ) becomes w = wt — 
A(l — t)x. Using this we can write the condition 

X > y in terms of the a priori known wt as: 

{l — r)x ' a; 

X < In a similar manner to the previous 
case, we can show that fi2 — using the com- 
plementary slackness condition /i2((l — t){'w'^ x — 
y) — £,) — 0- Therefore = A and w ~ Wt + \tx. 
Again the condition (w^x > y) can be written in 

terms of Wt as: ^~"r ^ > A. 



y. Plugging in w from (111 and rearrang- 
ing the terms we have 

wjx + (r(^i + ^2) - ^J.2)x^x = y 



Now we use p2|) to get an equation involving only 

■ r A and hence /ii = 



x—y 



IJ.2 which yields /Lt2 ^ , ^ 

A - M2 = ^"t^'^ + (1 - ■^)A- Plugging these back 
to the condition on w from the Lagrangian we get 

wJ x — y 
W — Wt S" — -X. 



F ACTIVE LEARNING 

The active learning algorithm of [2 proceeds in T 
rounds and in each round t it considers example Xt- 
It decides to query the label of Xt by first computing a 
rejection threshold based on Gt the difference in (im- 
portance weighted) errors of the ERM hypothesis and 
an alternative hypothesis. The latter minimizes the 
ermirical risk subject to predicting a different label j/a 
than that predicted by the ERM hypothesis on the 
current example. To get a handle on Gt we estimate 
t-Gt by the minimum importance weight that xt would 
need to have if Xt's label were ya so that the update 
caused the classification of the example to become ya- 

More specifically, for binary problems with y G {—1,1} 
assume y — sign(?i;^a;) and hence the alternative label 
is ya — ~y- Then we want an importance weight h 
such that {w — s{h)x)^x = where s{h) is computed 
using ya in place of y. Since s satisfies Q , by separat- 
ing variables, integrating both sides and making use of 
the initial condition we get Therefore we get that 



w'^ X , 



Notice that s doesn't even need to be in closed form to 
compute h. The reason is that the ODE ([T]) yields a 
solution directly in terms of since if we let F{s) = 

then by separation of variables in ( 7 ) 



dp 



To sum up, the implicit update for the quantile loss is S^^' 



w 



Wt-\il~T)x, > A 

Wt + Xtx, ^^^r ^ > A 

' TX ' X 

otherwise 



x—y 
Wt — — -X 



which is the same as the importance invariant update. 



E MORE INVARIANT UPDATES 



1 



du 



dl{p,ya 



p=wj Xt-u\\xt\\^ 



(13) 



For implicit updates we do not have a general method- 
ology but for each loss we have always been able to find 
a closed form solution for the importance. Below we 
show how to do this in the case of logistic loss. 



In this section we just list the updates for prediction 
under the logarithmic loss £{a-,y) = ylog^ -I- (1 — 
y) log using the link function cr = i+*^^''(p) (9,3 
usual p = w^ x). As in the case without the link func- 
tion, we get an explicit form for y £ {0, 1} which is 



p+tanh 



-0 + 1, 



sih) 



p— tanh 



ify = 

ify = i 



where 



UJl 



W{exp{e^P + 2p + Ahijx'^x)) 
IF(exp(e~2p -2p + 4:h7]x'^x)) 



Recall that W{z) is the Lambert W function. 



For the logistic loss the optimization problem that de- 
fines the implicit update is 

wt+i = argmin^||w-it;t|p-f/i77ilog(l+exp(-yQW^xt)) 

Setting the derivative of the above to zero leads to 

hrjtya 



Wt+l ^Wt + 



1 + exp{yawJ_^iXt) 



Xt- 



The update is implicit because Wt+i is given in terms 
of itself. In any case, to find the relevant importance 
weight we want wj^^xt = and by taking the inner 
product of the above equation with xt and plugging 



wJ_f^iXt — in we get 



— Wt Xt 



hrjtya T 
-^^t Xt 



2wJ Xt 
rjtyaxjxt 



G SCATTERPLOTS 



In Figure [3] we show a few more scatterplots. The 
trend is the same as in Figure [2j 



rcv1 - squared loss 



0.95 p 
0.945 - 

0.94 - 
0.935 - 

0.93 - 
0.925 - 

0.92 - 
0.915 - 

0.91 - 
0.905 - 
0.9 ^ 



0.9 0.905 0.91 0.9150.92 0.9250.930.935 0.940.9450.95 
invariant 



spam - quantile loss 



0.98 
0.97 
0.96 
0.95 
0.94 
0.93 
0.92 
0.91 
0.9 



0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 
invariant 

webspam - hinge loss 




0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 O.S 
invariant 



Figure 3: Scatter plot showing test accuracy with two 
different updates for various datasets and losses 



